Let’s start by loading the materials we’ll need:
trilogy datasetsThe main package we’ll use is the tidyverse, which is actually a collection of R packages with consistent design philosophy, grammar, and data structures.
To pull the current versions of the datasets, we’ll follow the steps outlined in the Getting Started > Use in Reproducible Research vignette. That’s why you’ll see a long, alphanumeric code in the links below, specifying precisely what version of the data is being used.
tmi <- read_csv("https://raw.githubusercontent.com/j-hagedorn/trilogy/d57c2cefd0b216c8ce5c251f618c3e931c732d0a/data/tmi.csv")
atu_df <- read_csv("https://raw.githubusercontent.com/j-hagedorn/trilogy/d57c2cefd0b216c8ce5c251f618c3e931c732d0a/data/atu_df.csv")
atu_seq <- read_csv("https://raw.githubusercontent.com/j-hagedorn/trilogy/d57c2cefd0b216c8ce5c251f618c3e931c732d0a/data/atu_seq.csv")
aft <- read_csv("https://raw.githubusercontent.com/j-hagedorn/trilogy/d57c2cefd0b216c8ce5c251f618c3e931c732d0a/data/aft.csv")
This allows us to explicitly reference a version of the data so that
any research we do can be precisely replicated by others. For instance,
if you wanted to pull an old (and not yet cleaned-up) version of the
aft dataset, you’d just need to go back in the GitHub
history and run the following:
old_aft <- read_csv("https://raw.githubusercontent.com/j-hagedorn/trilogy/f0fb12d108734847114f17980b05686a26305e38/data/aat.csv")
The table below shows the distinct count of motifs and tale types in Trilogy datasets.
x <-
atu_seq %>%
# Get all tale IDs and motif IDs within them
select(motif,atu_id) %>%
mutate(in_atu_seq = T) %>%
# Pull in all tale IDs present in annotated tales.
# Note that counting motifs in annotated tales assumes
# that each tale has each motif in the canonical sequence.
left_join(
aft %>% select(atu_id) %>% mutate(in_aft = T),
by = "atu_id"
) %>%
distinct(motif, in_atu_seq, in_aft) %>%
mutate(in_aft = if_else(is.na(in_aft),F,T)) %>%
# Don't keep multiple rows per motif, count as present (i.e. 'TRUE')
group_by(motif, in_atu_seq) %>%
filter(in_aft == max(in_aft))
m <-
tmi %>%
select(motif = id) %>%
distinct(motif) %>%
mutate(in_tmi = T) %>%
full_join(x, by = "motif") %>%
mutate(
in_tmi = if_else(is.na(in_tmi),F,in_tmi),
in_atu_seq = if_else(is.na(in_atu_seq),F,in_atu_seq),
in_aft = if_else(is.na(in_aft),F,in_aft)
)
m_sum <-
m %>%
summarise(
in_tmi = sum(in_tmi),
in_atu_seq = sum(in_atu_seq),
in_aft = sum(in_aft)
) %>%
mutate(unit = "motifs")
t <-
atu_df %>%
distinct(atu_id) %>%
mutate(in_atu_df = T) %>%
left_join(
atu_seq %>%
distinct(atu_id) %>%
mutate(in_atu_seq = T),
by = "atu_id"
) %>%
left_join(
aft %>%
distinct(atu_id) %>%
mutate(in_aft = T),
by = "atu_id"
) %>%
mutate(
in_atu_seq = if_else(is.na(in_atu_seq),F,in_atu_seq),
in_aft = if_else(is.na(in_aft),F,in_aft)
)
t_sum <-
t %>%
summarise(
in_atu_df = sum(in_atu_df),
in_atu_seq = sum(in_atu_seq),
in_aft = sum(in_aft)
) %>%
mutate(unit = "tale types")
m_sum %>%
bind_rows(t_sum) %>%
select(
unit, tmi = in_tmi,
atu_df = in_atu_df, atu_seq = in_atu_seq, aft = in_aft
) %>%
rmarkdown::paged_table()
rm(m); rm(m_sum); rm(t); rm(t_sum)
Note that motifs do not exist in atu_df or
atu_combos, because those datasets contain one row per
atu_id. Similarly, tale types (atu_ids) do not
exist in tmi, because that dataset contains one row per
motif.
The Trilogy’s datasets are linked by two key identifiers:
motifs (i.e. motif_id) and tale types
(i.e. atu_id). Understanding how these datasets overlap,
and the proportion of available motifs and tale types which make them
up, is necessary to using them successfully.
Below is an upset plot showing the intersection of discrete motifs across the various datasets which comprise the Trilogy.
library(UpSetR)
motif_lists <-
list(
tmi = tmi %>% distinct(id) %>% .$id,
atu_seq = atu_seq %>% distinct(motif) %>% .$motif,
aft = x %>% filter(in_aft) %>% .$motif
)
upset(
fromList(motif_lists),
order.by = "freq",
mainbar.y.label = "Intersection Size",
sets.x.label = "Motifs per Dataset"
)
rm(x); rm(motif_lists)
unjoined_motifs <-
atu_seq %>%
select(motif) %>%
distinct() %>%
anti_join(tmi, by = c("motif" = "id"))
Observations:
tmi are not present in
the ATU (i.e. atu_seq). Specifically, 42,457 of the 46222
motifs in the tmi (91.9%) are not present
in the ATU. This means that the tale types from the ATU make use of only
8.1% of the available motifs from the TMI.atu_seq),
most (n = 2969/3799) do not have corresponding
annotated texts in the aft.aft corpus. This is a minuscule 1.8% of the total available
motifs in the tmi. Fortunately, it can be increased with
time and dedication.atu_seq, but not in
the tmi. In 4 of these instances, there is one or more
corresponding tale text in the aft which contains a
non-tmi motif.1Below is an upset plot showing the intersection of discrete tale types across the various datasets which comprise the Trilogy.
type_lists <-
list(
atu_df = atu_df %>% distinct(atu_id) %>% .$atu_id,
atu_seq = atu_seq %>% distinct(atu_id) %>% .$atu_id,
aft = aft %>% distinct(atu_id) %>% .$atu_id
)
upset(
fromList(type_lists),
order.by = "freq",
mainbar.y.label = "Intersection Size",
sets.x.label = "Tale Types per Dataset"
)
rm(unjoined_motifs); rm(type_lists)
Observations:
atu_ids in atu_df are in
atu_seq, because some of the tale summaries from
atu_df do not reference any motif IDs.2. Specifically, 597
tale types in the ATU do not have distinct motif IDs identified.atu_id can be present in
atu_df and in aft, but not be included in
atu_seq, which means the text version of the tale cannot be
referenced against an available list of motifs. Fortunately, this only
applies to 8 tale types.atu_df, atu_seq, and aft.atu_df and
atu_seq) are not present in the aft. This
underscores the need for more annotated texts.The tmi is comprised of 46222 distinct motifs.3 It is
grouped into 23 ‘chapters’, including: Myths, Animals, Tabu, Magic,
Death, Marvels, Ogres, Tests, Wisdom and Folly, Deceptions, Reversals of
Fortune, Ordaining the Future, Chance and Fate, Society, Rewards and
Punishments, Captives and Fugitives, Cruelty, Sex, Nature of Life,
Religion, Traits of Character, Humor, Miscellaneous. Beneath the
chapter level are nested levels of groups, named as follows:
level_0 = What Thompson labelled ‘Grand divisions’,
sections divisible by 100.level_1 = Smaller divisions end with ‘0’, These are
defined at intervals of 10.level_2-level_6 = Various sections with
multiple layers of subdivisionsThe most populated level of the index (i.e. ‘3’) is that of the initial subdivision, indicating that there are frequently no splits made in the motif identified. While the index structure would allow for each subsequent level (i.e. levels 4 - 6+) to have increasing numbers of more finely grained motifs, these either do not exist or have not been filled in.
tmi %>%
ggplot(aes(x = level)) +
geom_histogram(stat="count") +
theme_minimal() +
theme(plot.title.position = "plot") +
labs(
title = "Most motifs are in the subdivisions",
subtitle = "From chapters (level 0) through subdivisions (levels 3-6)",
x = "Depth within index",
y = "Distinct motif entries"
)
The excerpt below shows how this hierarchy structure is represented in the ‘flat’ dataset:
tmi %>%
filter(level_2 == "B122") %>%
select(chapter_id,id,motif_name,level,starts_with("level_")) %>%
select(-level_5,-level_6) %>%
arrange(id) %>%
rmarkdown::paged_table()
Note that some motifs are not fit into the hierarchical level format
(i.e. level = NA). This occurs when there is a zero
indicator at one of the decimal indices, since this creates a break in
the hierarchical structure. For instance, in the “B122” section, we find
B122.1 as a parent motif for B122.1.1-2, but there is no B122.0 to serve
as a parent for B122.0.1.
The table below shows how many motifs (i.e. n_motifs)
are in each chapter, and each level (if you expand the
row).
library(reactable)
tmi %>%
group_by(chapter_name,level) %>%
summarize(n_motifs = n_distinct(id)) %>%
reactable(
groupBy = c("chapter_name"),
columns = list(
level = colDef(
aggregate = "count",
format = list(
aggregated = colFormat(suffix = " levels")
)
),
n_motifs = colDef(aggregate = "sum")
)
)
summary_df <-
tmi %>%
group_by(chapter_id,chapter_name,level_0,level_1,level_2,level_3) %>%
summarize(n = n()) %>%
ungroup() %>%
left_join(tmi %>% select(id,level_0_name = motif_name), by = c("level_0" = "id")) %>%
left_join(tmi %>% select(id,level_1_name = motif_name), by = c("level_1" = "id")) %>%
left_join(tmi %>% select(id,level_2_name = motif_name), by = c("level_2" = "id")) %>%
left_join(tmi %>% select(id,level_3_name = motif_name), by = c("level_3" = "id")) %>%
select(
starts_with("chapter_"),starts_with("level_0"),
starts_with("level_1"),starts_with("level_2"),starts_with("level_3"),n
)
Tale types are derived from the Aarne–Thompson–Uther Index (ATU). It
is represented in the Trilogy by three distinct datasets:
atu_df, atu_seq, and atu_combos.
Aditional documentation regarding each of these can be found in the data
dictionary.
The atu_df is comprised of 2247 distinct tale types,
each with a formal identifier (atu_id). It is grouped into
7 chapters, which are broken into sub-sections, or divisions. There are
42 divisions in the index. You can click on the treemap below to explore
each chapter and its divisions and subdivisions:
stack <-
atu_df %>%
group_by(chapter) %>%
summarize(n = n()) %>% ungroup() %>%
mutate(parent = "") %>% rename(child = chapter) %>% select(parent,child,n) %>%
bind_rows(
atu_df %>% select(parent = chapter, child = division) %>%
group_by(parent,child) %>% summarize(n = n()) %>% ungroup()
) %>%
bind_rows(
atu_df %>% select(parent = division, child = sub_division) %>%
group_by(parent,child) %>% summarize(n = n()) %>% ungroup()
) %>%
ungroup() %>%
filter(!is.na(child)) %>%
mutate(id = str_c("r_",row_number()))
plotly::plot_ly(
type = "treemap", # "sunburst"
branchvalues = "total",
labels = stack$child,
values = stack$n,
parents = stack$parent
)
rm(stack)
summary_atu_seq <-
atu_seq %>%
group_by(atu_id) %>%
summarize(
variants = max(tale_variant),
n_motifs = n_distinct(motif)
) %>%
ungroup()
The atu_seq dataset has one row for each occurrence of a
TMI motif within a tale type from the ATU index. It was produced by
pulling motif IDs from the tale_type description from the
atu_df dataset. There are an average of 41.6 tale
variants4
and 2.8 motifs per tale type, with the distributions shown below:
library(patchwork)
p1 <-
summary_atu_seq %>%
slice_min(order_by = variants, prop = 0.99) %>%
ggplot(aes(x = variants)) +
geom_histogram(stat="count") +
theme_minimal()
p2 <-
summary_atu_seq %>%
ggplot(aes(x = n_motifs)) +
geom_histogram(stat="count") +
theme_minimal()
p <- p1 / p2
p + plot_annotation(
title = 'How many stories? How many parts?',
subtitle = 'Distribution of variant and motif counts within the ATU',
caption = 'NB: Outliers removed from variants'
)
rm(p); rm(p1); rm(p2)
For some tales, multiple combinations of motifs are noted as possible permutations of the tale (for example, ATU 605A is a story in which “A young man, born of an animal… or from a giant… [B631, F611.1.1, F611.1.11-F611.1.15, T516] develops great strength (at the forge, in the forest, in war, by suckling for many years [F611.2.1, F611.2.3]…)”). In these instances, all of the possible permutations are listed as specific variants of the tale type. When ranges of motifs are referenced (e.g. F611.1.11-F611.1.15, above) all motifs within that range are included and provided with different variants.5
The table below shows the tale types with the top 1% of variants derived from the method described above:
summary_atu_seq %>%
slice_max(order_by = variants, prop = 0.01) %>%
rmarkdown::paged_table()
Here, we see that 6.1884^{4} (i.e. 46332 + 15552) tale variants out
of the total 6.8245^{4} are due to 2 extreme outliers caused by a
combinatorial explosion. Without these included, there would be 6361
tale variants, derived from a total of 1642 tale types in the
atu_seq dataset.
The unmatched motif IDs include: H1014, F111.7, X751,
X761, K461.2.1, J115.4, J1744.4, C111.10, B478, X1122.3.1, X208.2,
L452.1.7, T721.5, T101, D963, D590, F605.2, Z3, B581.1.2, D371.1,
N552.1.1, F661.1.1, K542.1, K2135.1, A221.3, D16102.2, W245, H161.1,
K19.5.3, D638.1, R195, J941, R131.1.3, T74.0.1. One can inspect these
individually and compare them to the tmi and find that
there are similar motifs that don’t quite match. For instance, there is
no “F661.1.1”, but there is a “F661.11”, or “Skillful Archer Uses Arrow
As Boomerang”.↩︎
Motifs are extracted from the tale summaries present in the ATU using the code here. An example of a tale summary without identified motif IDs is ATU 1342: “During a cold winter, a satyr (wood spirit) meets a man (boy) who is cold and accommodates him in his cave. The satyr watches the man blowing in his hand and is told that in this way he wants to warm his numb fingers. When the satyr serves up a meal, his guest blows on the food and explains that he wants to cool it. The satyr is afraid of this strange human behavior, blowing hot and cold in the same manner, and chases the man away.”↩︎
Recall that Yarlott and Finlayson (2016) counted “46,248 motifs and sub-motifs, 41,796 of which have references to tales or tale types.” While there is a difference in the total count of motifs, it is minimal, and it is unclear what they mean by sub-motifs.↩︎
With some extreme outliers, that is.↩︎
Note that when the suffix “ff.” (i.e. and following) is appended to a motif, we do not include all motifs following it, since it is unclear precisely what is intended by this convention.↩︎